An Improved Random Forest Classifier for Text Categorization

نویسندگان

  • Baoxun Xu
  • Xiufeng Guo
  • Yunming Ye
  • Jiefeng Cheng
چکیده

This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance. Index Terms — random forest, text categorization, random subspace, decision tree

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk

This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...

متن کامل

Text categorization on Reuters corpus

1 Task The task of text categorization can be described as follows: given a set of documents, we want to assign to each document one or more text categories or no category. In this term project, we want categorize documents from the well-known Reuters-21578 corpus which is a collection of 21578 articles published on Reuters in 1987. We have chosen only three most frequent text categories as the...

متن کامل

Automatic Categorization of Fanatic Texts

This paper presents a task of automatic categorization of fanatic texts. The analyzed set of texts stems from an Arabic environment in Kuwait, where teachers and students were asked questions regarding various terrorist tendencies. The responses were classified by a domain expert into one of three classes with respect to degree of fanaticism of their content. The main task was to develop an aut...

متن کامل

ForesTexter: An efficient random forest algorithm for imbalanced text categorization

In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in b...

متن کامل

An Improved Algorithm of Bayesian Text Categorization

Text categorization is a fundamental methodology of text mining and a hot topic of the research of data mining and web mining in recent years. It plays an important role in building traditional information retrieval, web indexing architecture, Web information retrieval, and so on. This paper presents an improved algorithm of text categorization that combines the feature weighting technique with...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JCP

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2012